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Abstract 

In the Nonnegative Matrix Factorization (NMF) problem we are given an n x to nonnegative 
matrix M and an integer r > 0. Our goal is to express M as AW where A and W are nonnegative 
matrices of size nx r and r xm respectively. In some applications, it makes sense to ask instead 
for the product AW to approximate M - i.e. (approximately) minimize — where 
II 11^ denotes the Frobenius norm; we refer to this as Approximate NMF. 

This problem has a rich history spanning quantum mechanics, probability th(K)ry, data anal- 
ysis, polyhedral combinatorics, communication complexity, demography, chcmometrics, etc. In 
the past decade NMF has become enormously popular in machine learning, where A and W are 
computed using a variety of local search heuristics. Vavasis recently proved that this problem 
is NP-complete. (Without the restriction that A and W be nonnegative, both the exact and 
approximate problems can be solved optimally via the singular value decomposition.) 

We initiate a study of when this problem is solvable in polynomial time. Our results are the 
following: 

1. We give a polynomial-time algorithm for exact and approximate NMF for every constant 
r. Indeed NMF is most interesting in applications precisely when r is small. 

2. We complement this with a hardness result, that if exact NMF can be solved in time 
{nm)°^^\ 3-SAT has a sub-exponential time algorithm. This rules out substantial im- 
provements to the above algorithm. 

3. We give an algorithm that rims in time polynomial in n. m and r under the separablity 
condition identified by Donoho and Stoddcn in 2003. The algorithm may be practical 
since it is simple and noise tolerant (under benign assumptions). Separability is believed 
to hold in many practical settings. 

To the best of our knowledge, this last result is the first example; of a polynomial-time algorithm 
that provably works under a non-trivial condition on the input and we believe that this will be 
an interesting and important direction for future work. 
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1 Introduction 



In the Nonnegative Matrix Factorization (NMF) problem we are given an n x m matrix M with 
nonnegative real entries (such a matrix will be henceforth called "nonnegative") and an integer 
r > 0. Our goal is to express M as AW where A and W are nonnegative matrices of size nx r and 
r xm respectively. We refer to r as the inner- dimension of the factorization and the smallest value 
of r for which there is such a factorization as the nonnegative rank of M. An equivalent formulation 
is that our goal is to write M as the sum of r nonnegative rank-one matrices.^ We note that r must 
be at least the rank of M in order for such a factorization to exist. In some applications, it makes 
sense to instead ask for AW to be a good approximation to M in some suitable matrix norm. We 
refer to the problem of finding a nonnegative A and W of inner-dimension r that (approximately) 
minimizes ||M — as Approximate NMF, where \\\\p denotes the Frobenius norm. Without 

the restriction that A and W be nonnegative, the problem can be solved exactly via singular value 
decomposition [12]. 

NMF is a fundamental problem that has been independently introduced in a number of different 
contexts and applications. Many interesting heuristics and local search algorithms (including the 
familiar Expectation Maximization or EM) have been proposed to find such factorizations. One 
compelling family of applications is data analysis, where a nonnegative factorization is computed 
in order to extract certain latent relationships in the data and has been applied to image segmenta- 
tion [24], [25] information retrieval [16] and document clustering [35]. NMF also has applications in 
fields such as chemometrics [23] (where the problem has a long history of study under the name self 
modeling curve resolution) and biology (e.g. in vision research [7]): in some cases, the underlying 
physical model for a system has natural restrictions that force a corresponding matrix factorization 
to be nonnegative. In demography (see e.g., [15]), NMF is used to model the dynamics of marriage 
through a mechanism similar to the chemical laws of mass action. In combinatorial optimization, 
Yannakakis [37] characterized the number of extra variables needed to succinctly describe a given 
polytope as the nonnegative rank of an appropriate matrix (called the "slack matrix" ) . In commu- 
nication complexity, Aho et al [1] showed that the log of the nonnegative rank of a Boolean matrix 
is polynomially related to its deterministic communication complexity - and hence the famous Log- 
Rank Conjecture of Lovasz and Saks [26] is equivalent to showing a quasi-polynomial relationship 
between real rank and nonnegative rank for Boolean matrices. In complexity theory, Nisan used 
nonnegative rank to prove lower bounds for non-commutative models of computation [ : ]. Addi- 
tionally, the 1993 paper of Cohen and Rothblum [S] gives a long list of other applications in statistics 
and quantum mechanics. That paper also gives an exact algorithm that runs in exponential time. 

Question 1.1. Can a nonnegative matrix factorization be computed efficiently when the inner- 
dimension, r, is small? 

Vavasis recently proved that the NMF problem is A'^P-hard when r is large[36], but this only 
rules out an algorithm whose running time is polynomial in n, m and r. Arguably, in most significant 
applications, r is small. Usually the algorithm designer posits a two-level generative model for the 
data and uses NMF to compute "hidden" variables that explain the data. This explanation is only 
interesting when the number of hidden variables (r) is much smaller than the number of examples 
(m) or the number of observations per example (n). In information retrieval, we often take M to 
be a "term-by-document" matrix where the {i,jY^ entry in M is the frequency of occurrence of 

^It is a common misconception that since the real rank is the maximum number of linearly independent columns, 
the nonnegative rank must be the size of the largest set of columns in which no column can be written as a nonnegative 
combination of the rest. This is false, and has been the source of many incorrect proofs demonstrating a gap between 
rank and nonnegative rank. A correct proof finally follows from the results of Fiorini et al [l J.]. 
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the i term in the j document in the database. In this context, a NMF computes r "topics" 
which are each a distribution on words (corresponding to the r columns of ^4) and each document 
(a column in M) can be expressed as a distribution on topics given by the corresponding column 
of W [16]. This example will be a useful metaphor for thinking about nonnegative factorization. 
In particular it justifies the assertion r should be small - the number of topics should be much 
smaller than the total number of documents in order for this representation to be meaningful. See 
Section A for more details. 

Focusing on applications, and the overwhelming empirical evidence that heuristic algorithms 
do find good-enough factorizations in practice, motivates our next question. 

Question 1.2. Can we design very efficient algorithms for NMF if we make reasonable assumptions 
about M? 

1.1 Our Results 

Here we largely resolve Question 1.1. We give both an algorithm for accomplishing this algorithmic 
task that runs in polynomial time for any constant value of r and we complement this with an 
intractability result which states that assuming the Exponential Time Hypothesis [20] no algorithm 
can solve the exact NMF problem in time {nm)°^^\ 

Theorem 1.3. There is an algorithm for the Exact NMF problem (where r is the target inner- 
dimension) that runs in time 0{{nmy ^'^). 

This result is based on algorithms for deciding the first order theory of the reals - roughly the 
goal is to express the decision question of whether or not the matrix M has nonnegative rank at 
most r as a system of polynomial equations and then to apply algorithms in algebraic geometry to 
determine if this semi-algebraic set is non-empty. The complexity of these procedures is dominated 
by the number of distinct variables occurring in the system of polynomial equations. In fact, the 
number of distinct variables plays an analogous role to VC-dimension, in a sense and the running 
time of algorithms for determining if a semi-algebraic set is non-empty depend exponentially on 
this quantity. Additionally these algorithms can compute successive approximations to a point in 
the set at the cost of an additional factor in the run time that is polynomial in the number of bits 
in the input and output. The naive formulation of the NMF decision problem as a non-emptiness 
problem is to use nr + mr variables, one for each entry in ^4 or 1^ [8]. This would be unacceptable, 
since even for constant values of r, the associated algorithm would run in time exponential in n 
and m. 

At the heart of our algorithm is a structure theorem - based on a novel method for reducing the 
number of variables needed to define the associated semi- algebraic set. We are able to express the 
decision problem for nonnegative matrix factorization using distinct variables (and we make 
use of tools in geometry, such as the notion of a separable partition, to accomplish this [1 I], [2], 
[18]). Thus we obtain the algorithm quoted in the above theorem. All that was known prior to our 
work (for constant values for r) was an exponential time algorithm, and local search heuristics akin 
to the Expectation-Maximization (EM) Algorithm with unproved correctness or running time. 

A natural requirement on A is that its columns be linearly independent. In most applications, 
NMF is used to express a large number of observed variables using a small number of hidden 
variables. If the columns of A are not linearly independent then Radon's Lemma implies that this 
expression can be far from unique. In the example from information retrieval, this translates to: 
there are candidate documents that can be expressed as a convex combination of one set of topics, 
or could alternatively be expressed as a convex combination of an entirely disjoint set of topics (see 
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Section 2.1). When we add the requirement that the columns of A be linearly independent, we refer 
to the associated problem as the Simplicial Factorization (SF) problem. In this case the doubly- 
exponential dependence on r in the previous theorem can be improved to singly-exponential. Our 
algorithm is again based on the first order theory of the reals, but here the system of equations is 
much smaller so in practice one may be able to use heuristic approaches to solve this system (in 
which case, the validity solution can be easily checked). 

Theorem 1.4. There is an algorithm for the Exact SF problem (where r is the target inner- 
dimension) that runs in time 0{{nmY ). 

We complement these algorithms with a fixed parameter intractability result. We make use of 
a recent result of Patrascu and Williams [-iD] (and engineer low-dimensional gadgets inspired by 
the gadgets of Vavasis [ Ui]) to show that under the Exponential Time Hypothesis [ - there is no 
exact algorithm for NMF that runs in time {nrn)°^'^\ This intractability result holds also for the 
SF problem. 

Theorem 1.5. // there is an exact algorithm for the SF problem (or for the NMF problem) that 
runs in time 0((nm)°(^)) then "i-SAT can be solved in 2°^"^ time on instances with n variables. 

Now we turn to Question 1.2. We consider the nonnegative matrix factorization problem under 
the " separability" assumption introduced by Donoho and Stodden [ : ] in the context of image 
segmentation. Roughly, this assumption asserts that there are r rows of A that can be permuted 
to form the identity matrix. If we knew the names of these rows, then computing a nonnegative 
factorization would be easy. The challenge in this context, is to avoid brute- force search (which 
runs in time n^) and to find these rows in time polynomial in n, m and r. To the best of our 
knowledge the following is the first example of a polynomial-time algorithm that provably works 
under a non-trivial condition on the input. 

Theorem 1.6. There is an exact algorithm that can compute a separable, nonnegative factorization 
M = AW (where r is the inner- dimension) in time polynomial in n, m and r if such a factorization 
exists. 

Donoho and Stodden [ ] argue that the separability condition is naturally met in the context 
of image segmentation. Additionally, Donoho and Stodden prove that separability in conjunction 
with some other conditions guarantees that the solution to the NMF problem is unique. Our 
theorem above is an algorithmic counterpart to their results, but requires only separability. Our 
algorithm can also be made noise tolerant, and hence works even when the separability condition 
only holds in an approximate sense. Indeed, an approximate separability condition is regarded as 
a fairly benign assumption and is believed to hold in many practical contexts in machine learning. 
For instance it is usually satisfied by model parameters fitted to various generative models (e.g. 
LDA [ ] in information retrieval). (We thank David Blei for this information.) 

Lastly, we consider the case in which the given matrix M does not have an exact low-rank NMF 
but rather can be approximated by a nonnegative factorization with small inner-dimension. 

Theorem 1.7. There is a 2P°iy(^i°g(V^))poly(n,m) -time algorithm that, given a M for which there 
is a nonnegative factorization AW (of inner- dimension r) which is an e- approximation to M in 
Frobenius norm, computes A' and W' satisfying 

\\M - A'W'Wj, < 0(6^/^1/^) ||M||^ . 



3 



The rest of the paper is organized as follows: In Section 2 we give an exact algorithm for 
the SF problem and in Section 3 we give an exact algorithm for the general NMF problem. In 
Section 4 we prove a fixed parameter intractability result for the SF problem. And in Section 5 and 
Section 6 we give algorithms for the separable and adversarial nonnegative fatorization problems. 
Throughout this paper, we will use the notation that Mi and are the i*^ column and j*^ row 
of M respectively. 

2 Simplicial Factorization 

Here we consider the simplicial factorization problem, in which the target inner- dimension is r 
and the matrix M itself has rank r. Hence in any factorization M = AW (where r is the inner- 
dimension), A must have full column rank and M must have full row rank. 

2.1 Justification for Simplicial Factorization 

We first argue that the extra restriction imposed in simplicial factorization is natural in many 
contexts: Through a re-scaling (see Section ?? for more details), we can assume that the columns of 
M, A and W all have unit norm. The factorization M = AW can be interpreted probabilistically: 
each column of M can be expressed as a convex combination (given by the corresponding column 
of W) of columns in j4. In the example in the introduction, columns of M represent documents 
and the columns of A represent "topics" . Hence a nonnegative factorization is an "explanation" : 
each document can be expressed as a convex combination of the topics. 

But if A does not have full column rank then this explanation is seriously deficient. This follows 
from a restatement of Radon's Lemma. Let conv{Au) be the convex hull of the columns Ai for 
i e U. 

Observation 1. If A is an nxr (with n > r) matrix and rank{A) < r, then there are two disjoint 
sets of columns U,V C [r] so that conv{Au) n conv{Av) / 0. 

The observation implies that there is some candidate document x that can be expressed as 
a convex combination of topics (in U), or instead can be expressed as a convex combination of 
an entirely disjoint set (V) of topics. The end goal of NMF is often to use the representation of 
documents as distributions on topics to perform various tasks, such as clustering or information 
retrieval. But if (even given the set of topics in a database) it is this ambiguous to determine 
how we should represent a given document as a convex combination of topics, then the topics we 
have extracted cannot be very useful for clustering! In fact, it seems unnatural to not require the 
columns of A to be linearly independent! 

Next, one should consider the process (probabilistic, presumably) that generates the datapoints, 
namley, columns of M. Any reasonable process for generating columns of M from the columns of 
A would almost surely result in a matrix M whose rank equals the rank of A. But then M has the 
same rank as A. 

2.2 Algorithm for Simplicial Factorization 

In this Section we give an algorithm that solves the simplicial factorization problem in (nm)*^^^-* 
time. Let L be the maximum bit complexity of any coefficient in the input. 

Theorem 2.1. There is an 0((nm)0('-')) time algorithm for deciding if the simplicial factorization 
problem has a solution of inner- dimension at most r. Furthermore, we can compute a rational 
approximation to the solution up to accuracy 5 in time poly(L, (nm)'-'^^ log 1/(5). 
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The above theorem is proved by using Lemma 2.3 below to reduce the problem of finding a 
simplicial factorization to finding a point inside a semi- algebraic set with poly{n) constraints and 2r^ 
real- valued variables (or deciding that this set is empty) . The decision problem can be solved using 
the well-known algorithm of Basu et. al.[3] solves this problem in vP^"^ ^ time. We can instead use 
the algorithm of Renegar [32] (a bound of poly (L, {nm)^^'^ •*) on the bit complexity of the coefficients 
in the solution due to Grigor'ev and Vorobjov [1.3]) to compute a rational approximation to the 
solution up to accuracy 5 in time poly(L, (nm)'^('' \ log 1/5). 

This reduction uses the fact that since A,W have full rank they have "pseudo- inverses" A'^, 
which are r x n and n x r matrices respectively such that A~^A = WW~^ = Irxr- Thus 
A+Mi = A+AWi = Wi and similarly Mm"^ = AK 

Definition 2.2. Let C = {ui,U2,..nr} be a basis for the columns of M in 3?^^, and let R = 
{vi,V2, ...Vr} be a basis for the rows of M in SR"^. 

Then Mc (a size rxm matrix) denotes the columns of M expressed in the basis C, and similarly 
M/j (a size n x r matrix) denotes the rows of M expressed in the basis 

Lemma 2.3 (Structure Lemma for Simplicial Factorization). M has a simplicial factorization rank 
r iff for every basis C for the columns and basis B for the rows of M , there are r x r matrices 
TcjTr such that: (i) TqMq and MrTh are nonnegative matrices (ii) MrThTcMc = M 

Proof: ("if") Suppose the conditions in the theorem are met. Then set A = M^Tr and W = 
TqMc- These matrices are nonnegative and have size nxr and rxm respectively, and furthermore 
are a factorization for M. Since rank{M) = r, A and W are a simplicial factorization. 

("only if") Conversely suppose that there is a simplicial factorization M = AW. Let C = {, , ..} 
and 3? = {,,...} be arbitrary bases for the columns and rows of M respectively. Let U and V be 
the corresponding nxr and m x r matrices. Let Mq and M/j be r x m and nxr representations 
in this basis for the columns and rows of M - i.e. UMc = M and MjiV^ = M. 

Define r x r matrices Tc = A'^U and Tr = V'^W'^ where A'^ and are the respective 
pseudoinverses of A, W. Let us check that this choice of Tc and Tr satisfies the conditions in the 
theorem. 

We can re-write TcMq = A~^UMc = A'^M = W and hence the first condition in the theorem 
is satisfied. Similarly MjiTr = Mj^V'^W'^ = MW~^ = A and hence the second and third condition 
are also satisfied. ■ 

3 General NMF 

Now we consider the NMF problem where the factor matrices A, W need not have full rank. 

Theorem 3.1. There is a 0{{nmy^^^'') time deterministic algorithm that given an nx m nonneg- 
ative matrix M outputs a factorization AW of inner dimension r if such a factorization exists. 

As in the Simplicial case the main idea will again be a reduction to an existence question for a 
semi- algebraic set, but this reduction is significantly more complicated than Lemma 2.3. 

3.1 General Structure Theorem: Minimality 

Our goal is to re-cast nonnegative matrix factorization (for constant r) as a system of polynomial 
inequalities where the number of variables is constant, the maximum degree is constant and the 
number of constraints is polynomially bounded in n and m. The main obstacle is that A and W 
are large - we cannot afford to introduce a new variable to represent each entry in these matrices. 
We will demonstrate there is always a "minimal" choice for A and W so that: 
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1. there is a collection of linear transformations Ti, T2, ■■■Tg^j.-) from the column-span of M to W 
and a choice function aw '■ ['ti] [gi''')] 

2. and a collection of linear transformations Si, S2, ••■'S'p(r-) from the row-span of M to 3?'' and a 
choice function o"^ : [n] [gir]] 

And these linear transformations and choice functions satisfy the conditions: 

1. for each i G [ra], Wj = T^^(^i)Mi and 

2. for each j e [m], A> = M^S^^^j). 

Furthermore, the number of possible choice functions aw is at most m'^^'^-^^^^ and the number of 
possible choice functions for a a is at most n'^ ^^^^ . These choice functions are based on the notion 
of a simplicial partition, which we introduce later. We then give an algorithm for enumerating all 
simplicial partitions (this is the primary bottleneck in the algorithm). Fixing the choice functions 
aw and aA, the question of finding linear transformations Ti, T2, ...T^^^) and 5i, ^2, ...S'p(^) that 
satisfy the above constraints (and the constraint that M = AW, and A and W are nonnegative) 
is exactly a system of polynomial inequalities with a 0{r'^g{r)) variables (each matrix Tj or Sj is 
r X r), degree at most four and furthermore there are at most 0{mn) polynomial constraints. 

In this subsection, we will give a procedure (which given A and W) generates a "minimal" 
choice for A and W (call this minimal choice A' and W), and we will later establish that this 
"minimal" choice satisfies the structural property stated informally above. 

Definition 3.2. Let Q{A) C 2^ denote the subsets of [r] corresponding to maximal independent 
sets of columns (of A). Similarly let 3?(T^) C 2M denote the subsets of [r] corresponding to maximal 
independent sets of rows (of W) . 

A basic fact from linear algebra is that all maximal independent sets of columns of A have 
exactly rank{A) elements and all maximal independent sets of rows of W similarly have exactly 
rankiW) elements. 

Definition 3.3. Let >-s be the total ordering on subsets of [r] of size s so that if U and V are both 

subsets of [r] of size s,U -<s F iff C/ is lexicographically before V . 

Definition 3.4. Given a column Mj, we will call a subset U G C(^) a minimal basis for Afj (with 
respect to A) if M, G cone{Au) and for all V G Q{A) such that M, G cone{Ay) we must have 
U V. 

Claim 3.5. If Mi G cone{A), then there is some U G S(^) such that Mi G cone^Ajj). 

Definition 3.6. A proper chain {A,W, A' ,W') is a set of nonnegative matrices for which M = 
AW, M = AW' and M = A'W' (the inner dimension of these factorizations is r) and functions 
aw' ■ [m] e{A) and aA' ■ [n] ^{W') such that 

1. for all i G [m], AW^ = Mi, supp{Wl) C aw'ii) and aw'ii) is a minimal basis with respect to 

A for Mi 

2. for all j G [n], A'jW' = M^ , supp{A^) C aA'{j) and aA'{j) is a minimal basis with respect to 
W for M^ . 
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Note that the extra conditions on W' (i.e. the minimal basis constraint) is with respect to A 
and the extra conditions on A' are with respect to W . This simphfies the proof that there is always 
some proper chain, since we can compute a W' that satisfies the above conditions with respect to 
A and then find an A' that satisfies the conditions with respect to W . 

Lemma 3.7. // there is a nonnegative factorization M = AW (of inner- dimension r), then 
there is a choice of nonnegative A' , W of inner- dimension r and functions ay/i : [m\ — )■ Q{A) and 
a A' '■ [n] — )• 01{W') such that (A, W, A' , W') and cJw' j <^A' form a proper chain. 

Proof: The condition that there is some nonnegative W for which M = AW is just the condition 
that for all i G [m], Mj E cone{A). Hence, for each vector Mj, we can choose a minimal basis 
U G QiA) using Claim 3.5. Then Mi G cone{Au) so there is some nonnegative vector W'^ supported 
on U such AWl = Mi and we can set (Jw>{i) = U . Repeating this procedure for each column Mj, 
results in a nonnegative matrix W' that satisfies the condition M = AW' and for each i € [m], by 
design supp{W[) C ow'ii) and (Jy/i{i) is a minimal basis with respect to A for Mj. 

We can re-use this argument above, setting = (W'^)A^ and this interchanges the role of A 
and W . Hence we obtain a nonnegative matrix A' which satisfies M = A'W' and for each j G [n], 
again by design we have that supp{A^) C o'A'ij) and (lA'ij) is a minimal basis with respect to W 
for M^. U 

Definition 3.8. Let Il{A, U) (for U G C(^)) denote the r x n linear transformation that is zero on 
all rows not in U (i.e. n(^, Uy = for j ^ U) and restricted to U is H(A, U)^ = {Au)+ (where 
the + operation denotes the Moore- Penrose pseudoinverse) . 

Lemma 3.9. Let {A,W, A' ,W') and a^i and ga' form a proper chain. For any index i G [rn\, 
let Ui = cr]yi{i) and for any index j G [n] let Vj = aA'ij)- Then Wl = Il{A,Ui)Mi and A'^ = 
Mm{W'^,Vjf. 

Notice that in the above lemma, the linear transformation that recovers the columns of W' is 
based on column subsets of A, while the linear transformation to recover the rows of A' is based 
on the row subsets of W' (not W) . 

Proof: Since {A,W, A' ,W') and ayy and aA' form a proper chain we have that AW' = M. Also 
supp{Wl) C Ui = aw'{i)- Consider the quantity U{A,Ui)Mi. For any j ^ Ui, {U{A,Ui)Mi)j = 0. 
So consider 

{U{A,U^)Mi)u, = {Au,)+AWl = {Au,)+Au„{Wl)u, 

where the last equality follows from the condition supp{Wl) C Ui. Since Ui G C(^) we have that 
(Ajj^)~^Aif^ is the \Ui\ x \Ui\ identity matrix. Hence W^ = Il{A,Ui)Mi. An identical argument with 
W' replaced with A' and with A replaced by W''^ (and i and Ui replaced with j and Vj) respectively 
implies that A'^ = MmiW^ ,Vj)'^ too. ■ 

Note that there are at most |C(A)| < 2^ linear trasformations of the form n(^, Ui) and hence 
the columns of W' can be recovered by a constant number of linear transformations of the column 
span of M, and similarly the rows of A' can also be recovered. 

The remaining technical issue is we need to demonstrate that there are not too many (only 
polynomially many, for constant r) choice functions a]yi and cr^/ and that we can enumerate 
over this set efficiently. In principle, even if say C(^) is just two sets, there are exponentially 
many choices of which (of the two) linear transformation to use for each column of M. However, 
when we use lexicographic ordering to tie break (as in the definition of a minimal basis), the 
number of choice functions is polynomially bounded. We will demonstrate that the choice function 
aw' '■ [fn] — ^ S(^) arising in the definition of a proper chain can be embedded in a restricted type 
of geometric partitioning of M which we call a simplicial partition. 
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3.2 General Structure Theorem: Simplicial Partitions 

Here, we establish that the choice functions and a^' in a proper chain are combinatorially 
simple. The choice function aw can be regarded as a partition of the columns of M into |S(^)| 
sets, and similarly the choice function aA' is a partition of the rows of M into Jl{W') sets. Here 
we define a geometric type of partitioning scheme which we call a simplicial partition, which has 
the property that there are not too many simplicial partitions (by virtue of this class having small 
VC-dimension), and we show that the partition functions aw' and cta' arising in the definition of 
a proper chain are realizable as (small) simplicial partitions. 

Definition 3.10. A (A;, s)-simplicial partition of the columns of M is generated by a collection of 
k sets of s hyperplanes 

^1 = {h\, hi ...h]}, = {hi hi .../i^}, = {hi hi, ...h';}. 

Let Qi = {i' s.t. for all j G [s], hj ■ Mi' > 0}. Then this collection of sets of hyperplanes results in 
the partition 

• Pi = Qi 

• P2 = Q2-Pl 

• Pk = Qk-Pi-P2-.-Pk-i 

. Pk+1 = [m]-Pi-P2...-Pk 

If rank{A) = s, we will be interested in a ( (^) , s)-simplicial partition. 

Lemma 3.11. Let {A,W, A' ,W') and ayyi and ua' form a proper chain. Then the partitions 
corresponding to ay/' and to aA> (of columns and rows of M respectively) are a [{^^ , s)- simplicial 
partition and a {(^^) ,t)- simplicial partition respectively, where rank{A) = s and rank{W') = t. 

Proof: Order the sets in C{A) according to the lexicographic ordering ^g, so that Vi ^2 -<s ■■■Vk 
for k = |S(^)|. Then for each j, let 'K^ be the rows of the matrix (ylv,.)+. Note that there are 
exactly rank{A) = s rows, hence this defines a (/c, s) -simplicial partition. 

Claim 3.12. aw'ii) = j if and only if Mi G Pj in the {k, s) -simplicial partition generated by 
5{1,?{2,...:K^. 

Proof: Since {A,W, A' ,W') and aw' and a^' forms a proper chain, we have that M = AW'. 
Consider a column i and the corresponding set Vi = a\y{i). Recall that Vj is the j*'* set in 
C{A) according to the lexicographic ordering ^g. Also from the definition of a proper chain Vi 
is a minimal basis for Mi with respect to A. Consider any set Vj' £ C(A) with j' < j. Then 
from the definition of a minimal basis we must have that Mi ^ cone{Ay.,). Since Vj' S C{A), we 
have that the transformation {Avj,){Av.,)^ is a projection onto span{A) which contains span{M). 
Hence {Av.,){Av_.,)'^ Mi = Mi, but Mj ^ cone{Av.,) so {Av.,)~^Mi cannot be a nonnegative vector. 
Hence Mj is not in Pji for any j' < j. Furthermore, Mj is in Qj-. using Lemma 3.9 we have 
n(^, Vj)Mi = U{A, Vj)AWl = W[>Q and so (^y.O+Mj = {Ii{A, Vj)Mi)v^ > 0. ■ 

We can repeat the above replacing A with W''^ and W with A' , and this implies the lemma. ■ 
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3.3 Enumerating Simplicial Partitions 

Here we give an algorithm for enumerating all {k, s)-simplicial partitions (of, say, the columns of 
M) that runs in time 0{m^'^^^^^^). An important observation is that the problem of enumerating all 
simplicial partitions can be reduced to enumerating all partitions that arise from a single hyperplane. 
Indeed, we can over-specify a simplicial partition by specifying the partition (of the columns of M) 
that results from each hyperplane in the set of ks total hyperplanes that generates the simplicial 
partition. From this set of partitions, we can recover exactly the simplicial partition. 

A number of results are known in this domain, but surprisingly we are not aware of any algorithm 
that enumerates all partitions of the columns of M (by a single hyperplane) that runs in polynomial 
time (for dim{M) < r and r is constant) without some assumption on M. For example, the VC- 
dimension of a hyperplane in r dimensions is r + 1 and hence the Sauer-Shelah lemma implies that 
there are at most 0{rrf~^^) distinct partitions of the columns of M by a hyperplane. In fact, a 
classic result of Harding (1967) gives a tight upper bound of 0(m'"). Yet these bounds do not 
yield an algorithm for efficiently enumerating this structured set of partitions without checking all 
partitions of the data. 

A recent result of Hwang and Rothblum [ ' '^-'J comes close to our intended application. A sepa- 
rable partition into p parts is a partition of the columns of M into p sets so that the convex hulls 
of these sets are disjoint. Setting p = 2, the number of separable partitions is exactly the number 
of distinct hyperplane partitions. Under the condition that M is in general position (i.e. there 
are no t columns of M lying on a dimension t — 2 subspace where t = rank{M) — 1), Hwang and 
Rothblum give an algorithm for efficiently enumerating all distinct hyperplane partitions [ ]. 

Here we give an improvement on this line of work, by removing any conditions on M (although 
our algorithm will be slightly slower). The idea is to encode each hyperplane partition by a choice 
of not too many data points. To do this, we will define a slight generalization of a hyperplane 
partition that we will call a hyperplane separation: 

Definition 3.13. A hyperplane h defines a mapping (which we call a hyperplane separation) from 
columns of M to {—1,0,1} depending on the sign of h ■ Mi (where the sign function is 1 for positive 
values, —1 for negative values and for zero). 

A hyperplane partition can be regarded as a mapping from columns of M to {—1,1} where we 
adopt the convention that Mj such that h o Mi is mapped to 1. 

Definition 3.14. A hyperplane partition (defined by h) is an extension of a hyperplane separation 
(defined by g) if for all i, g{Mi) ^ ^ g{Mi) = h{Mi). 

Lemma 3.15. Let rank{M) = s, then for any hyperplane partition (defined by h), there is a 
hyperplane g that contains s affinely independent columns of M and for which h (as a partition) is 
an extension of g (as a separation). 

Proof: After an appropriate linear transformation (of the columns of M and the hyperplanes) , we 
can assume that M is full rank. If the h already contains s affinely independent columns of M, 
then we can choose g = h. If not we can perturb h in some direction so that for any column with 
h{Mi) = 0, we maintain the invariant that Mi is contained on the perturbed hyperplane h' . Since 
rank{M) = s this perturbation has non-zero inner product with some column in M and so this 
hyperplane h' will eventually contain a new column from M (without changing the sign of h(Mi) 
for any other column). We can continue this argument until the hyperplane contains s affinely 
independent columns of M and by design on all remaining columns agrees in sign with h. ■ 
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Lemma 3.16. Let rank[M) = s. For any hyperplane h (which defines a partition), there is a 
collection of k < s sets of (at most s) columns of M, 81,82, ■■8k so that any hyperplanes gi,g2, ■■Qk 
which contain Si, 82, ■■■Sk respectively satisfy: For all i, h{Mi) (as a partition) is equal to the value 
of gj{Mi), where j is the smallest index for which gj{Mi) 7^ 0. Furthermore these subsets are 
nested: 81 D 82 ■■■ D 8k^ 

Proof: We can apply Lemma 3.15 repeatedly. When we initially apply the lemma, we obtain a 
hyperplane gi that can be extended (as a separation) to the partition corresponding to /i. In the 
above function (defined implicitly in the lemma) this fixes the partition of the columns except those 
contained in gi. So we can then choose M' to be the columns of M that are contained in gi, and 
recurse. If 82 is the largest set of columns output from the recursive call, we can add columns of 
M contained in gi to this set until we obtain a set of s + 1 affinely independent columns contained 
in gi, and we can output this set (as ^i). ■ 

Theorem 3.17. Let rank[M) = s. There is an algorithm that runs in time 0{m^{s + 2)*) time 
to enumerate all hyperplane partitions of the columns of M . 

Proof: We can apply Lemma 3.16 and instead enumerate the sets of points 81, 82, ■■■Sg^ Since 
these sets are nested, we can enumerate all choices as follows: 

• choose at most s columns corresponding to the set Si 

• initialize an active set T = 8\ 

• until T is empty either 

— choose a column to be removed from the active set 

— or indicate that the current active set represents the next set 8i and choose the sign of 
the corresponding hyperplane 

There are at most 0{m^(s + 2)'^) such choices, and for each choice we can then run a linear 
program to determine if there is a corresponding hyperplane partition. (In fact, all partitions that 
result from the above procedure will indeed correspond to a hyperplane partition) . The correctness 
of this algorithm follows from Lemma 3.16. ■ 

This immediately implies: 

Corollary 3.18. There is an algorithm that runs in time 0{m}"'^^) that enumerates a set of par- 
titions of the columns of M that contains the set of all {k, s)-simplicial partitions (of the columns 
ofM). 

3.4 Solving Systems of Polynomial Inequalities 

The results of Basu et al [3] give an algorithm for finding a point in a semi- algebraic set de- 
fined by 0{mn) constraints on polynomials of total degree at most d, and /(r) variables in time 
0((mnd)^^W). Using our structure theorem for nonnegative matrix factorization, we will re-cast 
the decision problem of whether a nonnegative matrix M has nonnegative rank r as an existence 
question for a semi- algebraic set. 

Theorem 3.19. There is an algorithm for deciding if a n x m nonnegative matrix M has nonneg- 
ative rank r that runs in time 0{{nm)^'^^ ^'^^). Furthermore, we can compute a rational approxi- 
mation to the solution up to accuracy 5 in time poly(L, (nm)'^(^ ^"^^ log 1/(5). 
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We first prove the first part of this theorem using the algorithm of Basu et al [ ] , and we instead use 
the algorithm of Renegar [ ] to compute a rational approximation to the solution up to accuracy 
6 in time poly(L, (nm)'^*-'' ^' log 1/(5). 

Proof: Suppose there is such a factorization. Using Lemma 3.7, there is also a proper chain. We 
can apply Lemma 3.11 and using the algorithm in Theorem 3.17 we can enumerate over a superset 
of simplicial partitions. Hence, at least one of those partitions will result in the choice functions 
cJvK' ^-iid cr^/ in the proper chain decomposition for M = AW. 

Using Lemma 3.9 there is a set of at most 2^ linear transformations Ti,T2, ...T2r which recover 
columns of W given columns of M, and similarly there is a set of at most 2*" linear transformations 
5i, 5*2, ...5'2'- which recover the rows of A' given rows of M. Note that these linear transformations 
are from the column-span and row-span of M respectively, and hence are from subspaces of di- 
mension at most r. So apply a linear transformation to columns of M and one to rows of M to to 
recover matrices A'Iq and Mji respectively (which are no longer necessarily nonnegative) but which 
are dimension r x m and n x r respectively. There will still be a collection of at most 2*" linear 
transformations from columns of Mc to columns of W , and similarly for Mr and A' . 

We will choose variables for each linear transformation, so there are 2 * r-^ * 2*^ variables in 
total. Then we can write a set of m linear constraints to enforce that for each column of {Mc)i, 
the transformation corresponding to a^'ii) recovers a nonnegative vector. Similarly we can define 
a set of n constraints based on rows in Mr. 

Lastly we can define a set of constraints that enforce that we do recover a factorization for 
M: For all i S [m],j £ [n], let i' = aw'{i) and j' = aA'iJ)- Then we write the constraint 
{Mcy Sj'Ti'(Mji)i = Mf. This constraint has degree at two in the variables corresponding to the 
linear transformations. Lemma 3.7 implies that there is some choice of these transformations that 
will satisfy these constraints (when we formulate these constraints using the correct choice functions 
in the proper chain decomposition). Furthermore, any set of transformations that satisfies these 
constraints does define a nonnegative matrix factorization of inner dimension r for M. 

And of course, if there is no inner dimension r nonnegative factorization, then all calls to the 
algorithm of Basu et al [ ] will fail and we can return that there is no such factorization. ■ 

The result in Basu et. al. [ ] is a quantifier elimination algorithm in the Blum, Shub and Smale 
(BSS) model of computation [ ,]. The BSS model is a model for real number computation and it 
is natural to ask what is the bit complexity of finding a rational approximation of the solutions. 
There has been a long line of research on the decision problem for first order theory of reals: given 
a quantified predicate over polynomial inequalities of reals, determine whether it is true or false. 
What we need for our algorithm is actually a special case of this problem: given a set of polynomial 
inequalities over real variables, determine whether there exists a set of values for the variables so that 
all polynomial inequalities are satisfied. In particular, all variables in our problem are quantified by 
existential quantifier and there are no alternations. For this kind of problem Grigor'ev and Vorobjov 
[13] first gave a singly-exponential time algorithm that runs in (n(i)O(^('^)') where n is the number 
of polynomial inequalities, d is the maximum degree of the polynomials and /(r) is the number 
of variables. The bit complexity of the algorithm is poly(L, (nd)'^^^^'^^ ^) where L is the maximum 
length of the coefficients in the input. Moreover, their algorithm also gives an upperbound of 
poly(L, (nd)*^'-'^^''))) on the number of bits required to represent the solutions. Renegar [3'2] gave 
a better algorithm that for the special case we are interested in takes time {nd)^^^^^^\ Using 
his algorithm with binary search (with search range bounded by Grigor'ev et.al.[l 'i]), we can find 
rational approximations to the solutions with accuracy up to 6 in time poly(L, (nm)*^'-'^''^-*-', log 1/d). 

We note that our results on the SF problem are actually a special case of the theorem above 
(because our structural lemma for simplicial factorization is a special case of our general structure 
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theorem): 

Corollary 3.20. There is an algorithm for determining whether the positive rank of a nonnegative 
n X m matrix M equals the rank and this algorithm runs in time 0({nmY^ ). 

Proof: If rank{M) = r, then we know that both A and W must be full rank. Hence C(^) and 
'Jl{W) are both just the set {1, 2, ...r}. Hence we can circumvent the simplicial partition machinery, 
and set up a system of polynomial constraints in at most 2r^ variables. ■ 

4 Strong Intractability of Simplicial Factorization 

Here we give evidence that finding a simplicial factorization of dimension r probably cannot be 
solved in (nm)"^^^ time, unless 3-SAT can be solved in 2°^"^ time (in other words, if the Exponential 
Time Hypothesis of [- ] is true). Surprisingly, even the A^P-hardness of the problem for general 
r was only proved quite recently by Vavasis [36]. That reduction is the inspiration for our result, 
though unfortunately we were unable to use it directly to get low-dimensional instances. Instead 
we give a new reduction using the d-SUM Problem. 

Definition 4.1 (d-SUM). In the d-SUM problem we are given a set of N values {si, S2, ...stv} 
each in the range [0, 1], and the goal is to determine if there is a set of d numbers (not necessarily 
distinct) that sum to exactly d/2. 

This definition for the d-SUM Problem is slightly unconventional in that here we allow repetition 
(i.e. the choice of d numbers need not be distinct). Patrascu and Williams [30] recently proved 
that if d-SUM can be solved in A°('^) time then 3-SAT has a sub-exponential time algorithm. In 
fact, in the instances constructed in [30] we can allow repetition of numbers without affecting 
the reduction since in these instances choosing any number more than once will never result in a 
sum that is exactly d/2. Hence we can re-state the results in [30] for our (slightly unconventional 
definition for) d-SUM. 

Theorem 4.2. If d < _/v0.99 if d- SUM instances of N distinct numbers each of 0{dlogN) bits 
can be solved in N°^'^^ time then 3-SAT on n variables can be solved in time 2°("\ 

Given an instance of the d-SUM, we will reduce to an instance of the Intermediate Simplex 
problem defined in [36]. 

Definition 4.3 (Intermediate Simplex). Given a polyhedron P = {x G : Hx > b} where H 

is an n X (r — 1) size matrix and b G 3?" such that the matrix [H, b] has rank r and a set S oi m 
points in the goal of the Intermediate Simplex Problem is to find a set of points T that form 

a simplex (i.e. T is a set of r affinely independent points) each in P such that the convex hull of T 
contains the points in 5. 

Vavasis [ ] proved that Intermediate Simplex is equivalent to the Simplicial Factorization 
problem. 

Theorem 4.4 (Vavasis, 2009 [■)'■]). There is a polynomial time reduction from Intermediate Simplex 
problem to Simplicial Factorization problem and vice versa and furthermore both reductions preserve 
the value of r. 

Interestingly, an immediate consequence of this theorem is that Simplicial Factorization is easy 
in the case in which rank{M) = 2 because mapping these instances to instances of intermediate 
simplex results in a one dimensional problem - i.e. the polyhedron P is an interval. 
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Figure 1: The Gadget 



4.1 The Gadget 

Given the universe U = {si, S2, • • • , sn} for the d-SUM problem, we construct a two dimensional 
Intermediate Simplex instance as shown in Figure 1. We will show that the Intermediate Simplex 
instance has exactly N solutions, each representing a choice of Sj. Later in the reduction we use d 
such gadgets to represent the choice of d numbers in the set U. 

Recall for a two dimensional Intermediate Simplex problem, the input consists of a polygon 
T (which is the hexagon ABCDEF in Figure 1) and a set of points S = {/i,l2, . . . iI^n} inside 
y (which are the dots, except for M). A solution to this two dimensional Intermediate Simplex 
instance will be a triangle inside J" such that all the points in S are contained in the triangle (in 
Figure 1 ACE is a valid solution). 

We first specify the polygon CP for the Intermediate Simplex instance. The polygon 7 is just 
the hexagon ABCDEF inscribed in a circle with center M. All angles in the hexagon are 27r/3, 
the edges AB = CD = EE = e where e is a small constant depending on N ^ d that we determine 
later. The other 3 edges also have equal lengths BC = DE = FA. 

We use y{A) and z{A) to denote the y and z coordinates for the point A (and similarly for all 
other points in the gadget). The hexagon is placed so that y{A) = y{B) = 0, y{D) = y{E) = 1. 

Now we specify the set S of 3A^ points for the Intermediate Simplex instance. To get these 
points first take points in each of the 3 segements AB, CD, EE. On AB these A^ points are 
called Ai, A2, An, and \AAi\ = esi. Similarly we have points Cj's on CD and -Ej's on EE, 
\CCi\ = \EEi\ = esi. Now we have A^ triangles AidEi (the thin lines in Figure 1). We claim (see 
Lemma 4.5 below) that the intersection of these triangles is a polygon with 3A^ vertices. The points 
in S are just the vertices of this intersection. 

Lemma 4.5. When e < 1/50, the points {Ai}, {Ci}, {Ei} are on AB, CD, EE respectively and 
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Figure 2: Proof of Lemma 4.5 

AAi = CCi = EEi = esi, the intersection of the N triangles {AiCiEi} is a polygon with 3N 
vertices. 

Proof: Since the intersection of N triangles AiCiEi is the intersection of 3N halfplanes, it has 
at most 3A^ vertices. Therefore we only need to prove every edge in the triangles has a segment 
remaining in the intersection. Notice that the gadget is symmetric with respect to rotations of 
27r/3 around the center M. By symmetry we only need to look at edges AiCi. The situation here 
is illustrated in Figure 2. 

Since all the halfplanes that come from triangles AiCiEi contain the center M, later when 
talking about halfplanes we will only specify the boundary line. For example, the halfplane with 
boundary AiCi and contains Ei (as well as M) is called halfplane AiCi. 

The two thick lines in Figure 2 are extensions of AB and CD, now they are rotated so that they 
are z = it-v/Sy. The two thin lines are two possible lines AiCi and AjCj. The differences between 
y coordinates of Ai and Ci are the same for all i (here normalized to 1) by the construction of 
the points Ai^s and Cj's. Assume the coordinates for Ai, Aj are {yi, —V^yi) and {yj, — V^yj) 
respectively. Then the coordinates for the intersection is {yi + yj + 1, \/3(l + yi + yj + 2yiyj)). This 
means if we have segments with yi < y2 < ■ ■ ■ < yN, segment i will be the highest one when 
y is in range {yi-i + yi + l,yi + yi+i + 1) (indeed, the lines with j > i have higher slope and will 
win when y > yi + yj + 1 > yi + yi+i + 1; the lines with j < i have lower slope and will win when 

y <yi + yj + i<yi + yi-i + i)- 

We also want to make sure that all these intersection points are inside the halfplanes CiEiS and 
EiAiS. Since e < 1/50, all the y^'s are within [—1/2 — 1/20, —1/2 + 1/20]. Hence the intersection 
point is always close to the point (0, -v/3/2), the distance is at most 1/5. At the same time, since 
e is small, the distances of this point (0, \/3/2) to all the Cj-Ej's and EiAiS are all larger than 
1/4. Therefore all the intersection points are inside the other 2A^ halfplanes and the segments will 
indeed remain in the intersection. The intersection has 3A edges and 3A vertices. ■ 
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Figure 3: Proof of Lemma 4.6 



The Intermediate Simplex instance has N obvious solutions: the triangles AidEi, each one 
corresponds to a value Si for the d-SUM problem. In the following Lemma we show that these are 
the only possible solutions. 

Lemma 4.6. When e < 1/1000, if the solution of the Intermediate Simplex problem is PQR, then 
PQR must he one of the AiCiEi 's. 

Proof: Suppose PQR is a solution of the Intermediate Simplex problem, since M is in the convex 
huh of {Ii, I2, . . . , /sat}, it must be in PQR. Thus one of the angles ZPMQ, ZQMR, ZRMP must 
be at least 2tt/3 (their sum is 27r). Without loss of generality we assume this angle is ZPMQ and 
by symmetry assume P is either on AB or BC. We shall show in either of the two cases, when P 
is not one of the ^j's, there will be some 1^ that is not in the halfplane PQ (recall the halfplanes 
we are interested in always contain M so we don't specify the direction). 

When P is on AB, since ZPMQ > 27r/3, we have CQ > AP (by symmetry when CQ = AP the 
angle is exactly 27r/3). This means we can move Q to Q' such that CQ' = AP. The intersection of 
halfplane PQ' and the hexagon ABCDEF is at least as large as the intersection of halfplane PQ 
and the hexagon. However, if P is not any of the points {Ai} (that is, \PQ'\/e {si,S2, ...,SAr}), 
then PQ' can be viewed as ^at+iCat+i if we add sn+i = \AP\/e to the set U. By Lemma 4.5 
introducing PQ' must increase the number of vertices. One of the original vertices Ik is not in the 
hyperplane PQ' , and hence not in PQR. Therefore when P is on AB it must coincide with one of 
the AiS, by symmetry PQR must be one of AiCiEiS. 

When P is on BC, there are two cases as shown in Figure 3. 

First observe that if we take U' = C/U{1 — si,l — S2,...,l — sat}, and generate the set 
S = {/i, I2, . . . , Iqn} according to U' , then the gadget is further symmetric with respect to flipping 
along the perpendicular bisector of BC. Now without loss of generality BP < BC/2. Since every 
Ik is now in the intersection of 2N triangles, in particular they are also in the intersection of the 
original triangles, it suffices to show one of Ik {k G [6A^]) is outside halfplane PQ. 

The first case (left part of Figure 3) is when BP < e. In this case we extend PQ to get 
intersection on AB (P') and intersection on CD {Q'). Again since ZPMQ > 2tt/2>, we have 
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DQ > BP. At the same time we know ZDQQ' > ZP'PB, so DQ' > BP'. Similar to the 
previous case, we take Q" so that CQ" = AP' . The intersection of hyperplane P'Q" and the 
hexagon ABCDEF is at least as large as the intersection of halfplane PQ and the hexagon. When 
e < 1/1000, we can check AP' < 2e <C 1/50, therefore we can still view P'Q" as some A2N+1C2N+1 
for S2N+1 < 2. Now Lemma 4.5 shows there is some vertex 1^ not in halfplane P'Q" (and hence 
not in halfplane PQ). 

The final case (right part of Figure 3) is when BP > e. In this case we notice the triangle with 
3 edges AD, BE., CF (the shaded triangle in the figure) is contained in every AiCiEi, thus it must 
also be in PQR. However, since BC/2 > BP > e, we know AR < e and DQ < e. In this case 
PQR does not even contain the center M. ■ 



4.2 The Reduction 

Suppose we are given an instance of the d-SUM Problem with values {si, S2, ...sat}. We will give 
a reduction to an instance of Intermediate Simplex in dimension r — 1 = 3d + 1. 

To encode the choice of d numbers in the set {si,S2, ...,SAr}, we use d gadgets defined in Sec- 
tion 4.1. The final solution of the Intermediate Simplex instance we constructed will include solu- 
tions to each gadget. As the solution of a gadget always corresponds to a number in {si, S2, sat} 
(Lemma 4.6) we can decode the solution and get d numbers, and we use an extra dimension w that 
"computes" the sum of these numbers and ensures the sum is equal to d/2. 

We use three variables {xi,yi,Zi} for the i^^ gadget. 

Variables 1. We will use 3d + 1 variables: sets {xi, yi,Zi} for i G [d\ and w. 

Constraints 1 (Box). For all i € [d], Xj, G [0, 1], G [0, 2] and also w G [0, 1]. 

Definition 4.7. Let G C?R.^ he the hexagon ABCDEF in the two-dimensional gadget given in the 
Section 4.1. Let H C ?fi^ he the set conv{{{xi,yi, Zi) G ?R.^\{yi,Zi) £ G,Xi = 1},0). 

H is a tilted-cone that has a hexagonal base G and has an apex at the origin. 

Definition 4.8. Let i? be a 7 x 3 matrix and 6 G so that {x\Rx > b} = H. 

We will use these gadgets to define (some of the) constraints on the polyhedron P in an instance 
of intermediate simplex: 

Constraints 2 (Gadget). For each i G [d], R{xi,yi,Zi) > b. 

Hence when restricted to dimensions Xj, yi, Zi the i*'* gadget G is on the plane Xi = 1. 

We hope that in a gadget, if we choose three points corresponding to the triangle for some value 
Si, that of these three points only the point on the AB line will have a non-zero value for w and 
that this value will be Sj. The points on the lines GD or EE will hopefully have a value close to 
zero. We add constraints to enforce these conditions: 

Constraints 3 (CE). For all i G [d], w < 1 — yj + (1 — Xi) 

These constraints make sure that points on CD or EF cannot have large w value. 
Recall that we use z{A) to denote the z coordinate of A in the gadget in Section 4.1. 

Constraints 4 (AB). For all i e [d]: w e f I^iHiMM ± (12 y. + (1 - Xi)) 
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Theses constraints make sure that points on AB have values in {si, S2, sat}- 

The AB and CE constraints all have the property that when Xi < 1 (i.e. the corresponding 

point is off of the gadget on the plane Xi = 1) then these constraints gradually become relaxed. 
To make sure the gadget still works, we don't want the extra constraints on w to rule out some 

possible values for Xi, yi, Zj's. Indeed we show the following claim. 

Claim 4.9. For all points in {xi, yi, Zi) G H , there is some choice of w G [0, 1] so that xi, yi, Zi and 
w satisfy the CE and AB Constraints. 

The proof is by observing that Constraints AB have almost no effect when y > and Constraints 
CE have no effect when y = 0. 

Constraints 1 to 4 define a polyhedron P 'm3d+ 1-dimensional space and furthermore the set of 
constraints that define P have full rank (in fact even the inequalities in the Box Constraints have 
full rank). Thus this polyhedron is a valid polyhedron for the Intermediate Simplex problem. 

Next we specify the points in S for the Intermediate Simplex problem(each of which will be 
contained in the polyhedron P). Let Ik (for k S [3A^]) be the set S in the gadget in Section 4.1. 
As before, let z{Ik) and y{Ik) be the z and y coordinates of respectively. 

Definition 4.10 {w-max{Ik)). Let w-inax{Ik) be the maximum possible w-value of any point I 
with Xi = 1, yi = y{Ik), Zi = z{Ik) and Xj,yj, zj = for all j ^ i so that / is still contained in P. 

Definition 4.11 (O, W, I^, Q). The set S of points for the Intermediate Simplex problem is 

O point: For all i £ [d], Xi, yi, Zi = and w = 

W point: For all i £ [d], Xi,yi, Zi = and w = 1 

points: For each i £ [d], for each k G [3A^] set Xi = 1/4, yi = l/4y(/fc), Zi = l/4z{Ik) and for 
j ^ i set Xj,yj, Zj = 0. Also set w to be the 1/4 x i(;-max(/fc). 

Q point: For each i £ [d], Xi = 1/d, yi = y{M)/d, Zi = z{M)/d and w = 1/6 

This completes the reduction of 3-SUM to intermediate simplex, and next we establish the 
COMPLETENESS and SOUNDNESS of this reduction. 

4.3 Completeness and Soundness 

The completeness part is straight forward: for i*'' gadget we just select the triangle that corresponds 

to Sfe-. 

Lemma 4.12. If there is a set {s^j^, Sk2, ■■■Sk^} of d values (not necessarily distinct) such that 
J2ie[d] = d/2 then there is a solution to the corresponding Intermediate Simplex Problem. 

Proof: We will choose a set of 3d + 2 points T: We will include the O and W points, and for each 
Sfc. , we will choose the triangle corresponding to the value s^. in the i^^ gadget. Recall the triangle 
is Afc.Cfc.E'fc. in the gadget defined in Section 4.1. The points we choose have Xi = 1 and yi, Zi 
equal to the corresponding point in the gadget. We will set w to be s/j^ for the point on the line 
AB and we will set w to be zero for the other two points not contained in the line AB. The rest 
of the dimensions are all set to 0. 

Next we prove that the convex hull of this set of points T contains all the points in 5: The points 
O and W are clearly contained in the convex hull of T (and are in fact in T!). Next consider some 
point 11 in S corresponding to some intersection point I^ in the gadget G. Since Ik is in the convex 
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hull of the triangle corresponding to Sfc- in the gadget G, there is a convex combination of the these 
three points A/^. , C^. , E'fc^ in T (which we call J) so that 1 /4 J matches /|, on all coordinates except 
possibly the ^i;-coordinate. Furthermore the point J has some value in the coordinate corresponding 
to w and this must be at most the corresponding value in (because we chose the w-value in 
to be 1/4 X w-max(Ik)). Hence we can distribute the remaining 3/4 weight among the O and W 
points to recover exactly on all coordinates. 

Lastly, we observe that if we equally weight all points in T (except O and W) we recover the 
point Q. In particular, the w coordinate of Q should be ^ Yli=i Sfe^ = 1/6- • 

Next we prove SOUNDNESS for our reduction. Suppose the solution is T, which is a set of 
3d + 2 points in the polyhedron P and the convex hull of points in T contains all the O, W, Q 
points (in Definition 4.11). 

Claim 4.13. The points O and W must he in the set T. 

Proof: The points O and W are vertices of the polyhedron P and hence cannot be expressed as a 
convex combination of any other set of points in P. ■ 

Now we want to prove the rest of the 3d points in set T is partitioned into d triples, each triple 
belongs to one gadget. Set T' = T - {0} - {W}. 

Definition 4.14. For i £ [d], let 

T/ = {Z £ T'\j ^ i =^ Xj{Z),yj{Z),Zj{Z) = and one of Xi{Z),yi{Z), Zi{Z) / 0} 
Claim 4.15. The sets T- partition T' and each contain exactly 3 nodes. 

Proof: The sets T/ are disjoint, and additionally each set must contain at least 3 nodes (otherwise 
the convex hull of T- even restricted to Xi,yi,Zi cannot contain the points I^). This implies the 
Claim. ■ 

Recall the gadget in Section 4.1 is a two dimensional object, but it is represented as a three 
dimensional cone in our construction. We would like to apply Lemma 4.6 to points on the plane 
Xi = 1 (in this plane the coordinates yi,Zi act the same as y, z in the gadget). 

Definition 4.16. For each point Z £ T[, let ext{Z) £ be the intersection of the line connecting 
the origin and {xi{Z),yi{Z), Zi{Z)) with the Xi = 1 base of the set {{xi,yi, Zi)\R{xi,yi, Zi) > b}. 
Let ext(Tl) be the point- wise ext operation applied to each point in T/. 

Since the points are in the affine hull of T- when restricted to Xi,yi,Zi , we know ext(/^) 
must be in the convex hull of ext{T-). Using Lemma 4.6 in Section 4.1, we get: 

Corollary 4.17. ext(T-) must correspond to some triangle Aj^-C^iEk. for some value s^. . 

Now we know how to decode the solution T and get the numbers Sk^■ We will abuse notation 
and call the 3 points in T/ A].. , Cfc- , E^. (they were used to denote the corresponding points in the 
2-d gadget in Section 4.1). We still want to make sure the w coordinate correctly "computes" the 
sum of these numbers. As a first step we want to show that the Xi of all points in T'^ must be 1 
(we need this because the Constraints AB and CE are only strict when Xi = 1). 

Lemma 4.18. For each point Z £ T/, Xi{Z) = 1 
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Proof: Suppose, for the sake of contradiction, that Xi{Z) < 1 (for Z G T/). Then consider the point 
Q. Since X^jg[(j] Xi{Q) = 1, and for any point in T Ylii^id] ^ li there is no convex combination of 
points in T that places non-zero weight on Z and equals Q. 

Let T'-' be Tj'\{Z}, we observe that the points in T[' are the only points in T that have any 
contribution to {xi,yi,Zi) when we want to represent Q (using a convex combination). For now 
we restrict our attention to these three dimensions. When trying to represent Q we must have 1/d 
weight in the set T/' (because of the contribution in Xi coordinate). The yi, Zi coordinates of Q 
are y{M)/d, z{M)/d respectively. This means if we take projection to yi,Zi plane M must be in 
the convex hull of T". However that is impossible because no two points in A^CkEj, contain M in 
their convex hull. This contradiction implies the Lemma. ■ 

Lemma 4.19. Any convex combination of points in T that equals the point Q must place equal 
weight on all points in T' . 

Proof: Using Lemma 4.18, we conclude that the total weight on points in T'^ is exactly l/d, and 
there is a unique convex combination of the points T[ (restricted to yi,Zi) that recover the point 
M which is the 1/3, 1/3, 1/3 combination. This implies the Lemma. ■ 

Now we are ready to compute the w value of the point Q and show the sum of s^. is indeed 
d/2. 

Lemma 4.20 (Soundness). When e < N^^'^ for some large enough constant C, if there is a 
solution to the Intermediate Simplex instance, then there is a choice of d values that sum up to 
exactly d/2. 

Proof: As we showed in previous Lemmas, the solution to the Intermediate Simplex problem must 
contain O, W , and for each gadget i the solution has 3 points T/ that correspond to one of the 
solutions of the gadget. Suppose for gadget i the triangle we choose is Af^-Ck^E^^. By Constraints 
AB we know tt;(Afc.) = Sk^, by Constraints CE we know w{Cki) < e and wiEk^) < e. 

By Lemma 4.19 there is only one way to represent Q, and w{Q) = ^ X^^Li [""^(^fcj + ^(C'fcJ + 
w{Eu,)] = 1/6. 

d d , d 

^Sk,=Y,HAk,) = ^-Y.i'^iCkJ + wiEkJ]. (1) 

i=l 1=1 i=l 

Since w{Ck^) and w{EkJ's are small, we have Yli=i G ['^/2 — 2de, d/2]. However the numbers 
only have 0{d\ogN) bits and e is so small, the only valid value in the range is d/2. Hence the sum 
Z^iLi Sfcj must be equal to d/2. ■ 

5 Fully-EfRcient Factorization under Separability 

Earlier, we gave algorithms for NMF, and presented evidence that no {nm)"^'^'^ time algorithm 
exists for determining if a matrix M has nonnegative rank at most r. Here we consider conditions 
on the input that allow the factorization to be found in time polynomial in n, m and r. (In 
Section 5.1, we give a noise-tolerant version of this algorithm). To the best of our knowledge this 
is the first example of an algorithm (that runs in time poly(n, m,r)) and provably works under 
a non-trivial condition on the input. Donoho and Stodden [10] in a widely-cited paper identified 
sufficient conditions for the factorization to be unique (motivated by applications of NMF to a 
database of images) but gave no algorithm for this task. We give an algorithm that runs in time 
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poly(n, m,r) and assumes only one of their conditions is met (separability). We note that this 
separability condition is quite natural in its own right, since it is usually satisfied [4] by model 
parameters fitted to various generative models (e.g. LDA ["] in information retrieval). 

Definition 5.1 (Separability). A nonnegative factorization M = AW is called separable if for each 
i there is some row f{i) of A that has a single nonzero entry and this entry is in the i*^ column. 

Let us understand this condition at an intuitive level in context of clustering documents by 
topic, which was discussed in the introduction. Recall that there a column of M corresponds to 
a document. Each column of A represents a topic and its entries specify the probability that a 
word occurs in that topic. The NMF thus "explains" the i*^ document as AWi where the column 
vector Wi has (nonnegative) coordinates summing to one — in other words, Wi represents a convex 
combination of topics. In practice, the total number of words n may number in the thousands 
or tens of thousands, and the number of topics in the dozens. Thus it is not unusual to find 
factorizations in which each topic is fiagged by a word that appears only in that topic and not in 
the other topics [1]. The separability condition asserts that this happens for every topic^. 

For simplicity we assume without loss of generality that the rows of M are normalized to have 
unit £i-norm. After normalizing M, we can still normalize W (while preserving the factorization) 
by re- writing the factorization as M = AW = {AD){D~'^W) for some r x r nonnegative matrix D. 
By setting Di^i = the rows of D~^W will all have li norm 1. When rows of M and W are 

all normalized the rows of A must also have unit £i-norm because 

r r r 

i = ||M^IL= 5]A,,Ty^- =J2A,j\\w^\^ = Y,Aij- 

i=i 1 i=i i=i 

The third equality uses the nonnegativity of W. Notice that after this normalization, if a row 
of A has a unique nonzero entry (the rows in Separability), that particular entry must be one. 
We also assume is a simplicial matrix defined as below. 

Definition 5.2 (simplicial matrix). A nonnegative matrix W is simplicial if no row in W can be 
represented in the convex hull of the remaining rows in W. 

The next lemma shows that without loss of generality we may assume W is simplicial. 

Lemma 5.3. If a nonnegative matrix M has a separable factorization AW of inner- dimension at 
most r then there is one in which W is simplicial. 

Proof: Suppose W is not simplicial, and let the j^^ row W^ be in the convex hull of the remaining 
rows. Then we can represent W^ = u^W where n is a nonnegative vector with \u\i = 1 and the 
j^^ coordinate is 0. 

Now modify A as follows. For each row A^ in A that has a non-zero j^^ coordinate, we zero 
out the j*'^ coordinate and add A-j u to the row A^' . At the end the matrix is still nonnegative but 
whose j^^ column is all zeros. So delete the j^^ column and let the resulting n x (r — 1) matrix be 
A'. Let W' be the matrix obtained by deleting the j^^ row of W . Then by construction we have 
M = A'W' . Now we claim A' is separable. 

Since A was originally separable, for each column index i there is some row, say the f{iY^ row, 
that has a non-zero entry in the i^^ column and zeros everywhere else, li i ^ j then by definition 



^More realistically, the word may appear in other topics only with negligible property instead of zero probability. 
This is allowed in our noise-tolerant algorithm later. 
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the above operation does not change the f{i) row of A. li i = j the j index is deleted at the 
end. In either case the final matrix A' satisfies the separability condition. 

Repeating the above operation for all violations of the simplicial condition we end with a 
separable factorization of M (again with inner-dimension at most r) where W is simplicial. ■ 

Theorem 5.4. There is an algorithm that runs in time polynomial in n, m and r and given a 
matrix M outputs a separable factorization with inner- dimension at most r (if one exists). 

Proof: We can apply Lemma 5.3 and assume without loss of generality that there is a factorization 
M = AW where A is separable and W is simplicial. The separability condition implies that every 
row of W appears among the rows of M. Thus W is hiding in plain sight in M; we now show how 
to find it. 

Say a row is a loner if (ignoring other rows that are copies of M^) it is not in the convex 
hull of the remaining rows. The simplicial condition implies that the rows of M that correspond to 
rows of W are loners. 

Claim 5.5. A row AP is a loner iff AI^ is equal to some row 

Proof: Suppose (for contradiction) that a row in iVP is not a loner and but it is equal to some 
row W^. Then there is a set S of rows of M so that is in their convex hull and furthermore for 
all j' G S, is not equal to MK Thus there is a nonnegative vector w G SR" that is at the j*'^ 
coordinate and positive on indices in S such that M = . 

Hence AW = = W'', but u^A must have unit £i-norm (because \\u\\-^ = 1, all rows of A 
have unit £i-norm and are all nonnegative), also u^A is non-zero at position j'. Consequently VF* 
is in the convex hull of the other rows of W, which yields a contradiction. 

Conversely if a row is not equal to any row in W, we conclude that is in the convex 
hull of the rows of W. Each row of W appears as a row of A (due to the separability condition). 
Hence is not a loner because is in the convex hull of rows of M that are equivalent to 
itself. ■ 

Using linear programming, we can determine which rows are loners. Due to separability 
there will be exactly r different loner rows, each corresponds to one of the W^. Thus we are able 
to recover W' that is equal to W after permutation over rows. We can compute a nonnegative A' 
such that A'W' = M, and such solution A' is necessarily separable (since it is just equal to A after 
permutation over columns). ■ 

5.1 Adding Noise 

In any practical setting the data matrix M will not have an exact NMF of low inner dimension 
since its entries are invariably subject to noise. Here we consider how to extend our separability- 
based algorithm to work in presence of noise. We assume that the input matrix M' is obtained 
by perturbing each row of M by adding a vector of ^i-norm at most e, where M has a separable 
factorization of inner-dimension r. Alternatively, ||M'* — < e for all i. Notice that the case 

in which the separability condition is only approximately satisfied is a subcase of this: If for each 
column there is some row in which that column's entry is at least 1 — e and the sum of the other 
row entries is less than e then the matrix M' will satisfy the condition stated above. (Note that 
M, A, W have been scaled as discussed above.) 

Our algorithm will require one more condition - namely, we require the unknown matrix W to 
be "robustly" simplicial instead of just simplicial. 
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Definition 5.6 (a-robust simplicial) . We call W a-robust simplicial if no row in W has £i distance 
smaller than a to the convex hull of the remaining rows in W. (Here all rows have miit £i-norm.) 



Recall from Lemma 5.3 that the simplicial condition can be assumed without loss of general- 
ity under separability. In general a-robust simplicial condition does not follow from separability. 
However, any reasonable generative model would surely posit that the matrix W — whose columns 
after all represents distributions — satisfies the condition above. For instance, if columns of W are 
picked randomly from the unit ii ball then after normalization a is more than 1/10. Regardless of 
whether or not one self-identifies as a bayesian, it seems reasonable that any suitably generic way 
of picking column vectors would tend to satisfy the a-robust-simplicial property. 

Theorem 5.7. Suppose M = AW where A is separable and W is a-robust simplicial. Let e satisfy 
20e/a + 13e < a. Then there is a polynomial time algorithm that given M' such that for all rows 
||M'* — < e, finds a nonnegative matrix factorization A'W of the same inner dimension such 

that the ii norm of each row of M' — A'W' is at most lOe/a + 7e. 

Proof: Separability implies that for any column index i there is a row /(i) in A whose only nonzero 
entry is in the i*'^ column. Then M^^^^ = W^ and consequently ||M'-^*^*) — < e. Let us call 

these rows M'-^^*^ for all i the canonical rows. From the above description the following claim is 
clear since the rows of M can be expressed as a convex combination of W^^s. 

Claim 5.8. Every row M'^ has ii-distance at most 2e to the convex hull of canonical rows. 

Proof: 



M'^ - Aj^kM' 



fik) 



k=l 



< \\M'^ - M^'IL + 



k=l 



+ 



k=l 



and we can bound the right hand side by 2e. ■ 

Next, we show how to find the canonical rows. For a row M'^ , we call it a robust-loner if upon 
ignoring rows whose £i distance to M'^ is less than d = 5e/a + 2e, the £i-distance of M'^ to the 
convex hull of the remaining rows is more than 2e. Note that we can identify robust-loner rows 
using linear programming. 

The following two claims establish that a row of M'^ is a robust-loner if and only if it is close 
to some row W^. 

Claim 5.9. If M'^ has distance more than d+e to all of the W^ 's, then it cannot be a robust loner. 

Proof: Such an M'^ has distance at least d to each of the canonical rows. The previous claim 
shows M'^ is close to the convex hull of the canonical rows and thus by definition it cannot be a 
robust-loner. ■ 



Claim 5.10. All canonical rows are robust-loners. 
Proof: Since 

-w'\\^<e, when we check if M'*^"^' is a robust-loner (using linear program- 
ming), we leave out of consideration all rows that have £i-distance at most be/a + e to W^. In 
particular, this omits any row M'^ such that = Yl]i=i ^j,kW^ and Aj^i > 1 — 5e/a. All remain- 
ing rows have Aj^i < 1 — be/ a, and hence the £i distance of W^ to conv{W\W^) is at least a (by 
the a-robust simplicial property), we conclude that the distance between W^ and the convex hull 
of remaining M-^'s must be at least 5e/a * a = 5e. Since M' is close to M the ^i-distance between 



22 



M'-^(*) and the convex hull of remaining rows M'-^'s must be at least 5e — 2e = 3e. Therefore M'*^"^' 
is a robust-loner. ■ 

The previous claim implies that each robust-loner row is within ^i-distance d -|- e to some VF* 
and conversely, for every there is at least one robust-loner row that is close to it. Since the 
£i-distances between VF*'s are at least 4((i -|- e), we can apply distance based clustering on the 
robust-loner rows: place two robust-loner rows into the same cluster if and only if these rows 
are within £i-distance at most 2{d + e). Clearly we will obtain r clusters, one corresponding 
to each of the M^*'s. Choose one row from each of the cluster, and using similar argument as 
Claim 5.8 we deduce that every row of M' is within 2{d -|- e) -|- e = lOe/a -|- 7e to the convex hull 
of the rows we selected. Therefore these rows form a nonnegative W and we can find A' so that 
\\M'i - {A'W'YW^ ^ + all j. U 

6 Approximate Nonnegative Matrix Factorization 

Here we consider the case in which the given matrix does not have an exact low-rank NMF but 
rather can be approximated by a nonnegative factorization with small inner-dimension. We refer 
to this as Approximate NMF. Unlike the algorithm in Theorem 5.7, the algorithm here works with 
general nonnegative matrix factorization: we do not make any assumptions on matrices A and W. 
Throughout this section we will use || ||^ to denote the Froebenius norm, || to denote the spectral 
norm and |||| applied to a vector will denote the standard Euclidean norm. 

Theorem 6.1. Let M be an n x m nonnegative matrix such that there is a factorization AW 
satisfying \\M — AW\\p < e ||M||^, where A and W are nonnegative and have inner-dimension r. 
There is an algorithm that computes A' and W satisfying 

\\M - A'W'Wp < 0(6^/^1/^) ||M||^ 

in time 2P°^y('''°s(V^))poly(n,m). 

Note that the matrix M need not have low rank, but we will be able to assume M has rank 
at most r without loss of generality: Let M' be the best rank at most r approximation (in terms 
of Frobenius norm) to M. This can be computed using a truncated singular value decomposition 
(see e.g. [I-]). Since A and W have inner-dimension r, we get: 

Claim 6.2. \\M' - M\\p < \\M - AW\\p 

Throughout this section, we will assume that the input matrix M has rank at most r - since 
otherwise we can compute M' and solve the problem for M' . Then using the triangle inequality, 
any good approximation to M' will also be a good approximation to M. 

Throughout this section, we will use the notation At to denote the t*'* column of A and to 
denote the t*'* row of W. Note that is a row vector so we will frequently use AtW*' to denote 
an outer-product. Next, we apply a simple re-normalization that will allow us to state the main 
steps in our algorithm in a more friendly notation. 

Lemma 6.3. We can assume without loss of generality that for all t 

\\W^\\ = l (2) 
Ptil < (l + e)||M||^ (3) 

and furthermore ||^||^ < (1 + e) 
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Proof: We can write AW = Ylt=i-^tW*. So we may scale At,W^ to ensure that 



Next, since A and W are nonnegative we have ||j4M^||p > 
(1 + e) ||M||^ and this imphes the first condition in the lemma. 
Next we observe 



\At\\ \\W^ 



1. 



and \\AW\\p < 



\\AW\\ 



n m 

EE 

i=i j=i 



t=i 



t=i i=i j=i 



U2 



\Al 



where the inequality follows because all entries in A and W are nonnegative, and the last equality 
follows because = 1. ■ 

Note that this lemma immediately implies that HM^Hj? < \/r. 

The intuition behind our algorithm is to decompose the unknown matrix W as the sum of two 
parts: W = Wq + Wi. The first part Wq is responsible for how good AW is as an approximation 
to M (i.e., \\M — ^VFoll^ is small) but could be negative; the second part W\ has little effect on 
the approximation but is important in ensuring the sum Wq + W\ is nonnegative. The algorithm 
will find good approximations to Vt'o, W\. 

What arc Wq, Wil Since removing W\ has little effect on how good AW is as an approximation 
to M, this matrix should be roughly the projection of W onto the "less significant" singular vectors 
of A. Namely, let the singular value decomposition of A be 



(4) 



t=\ 



and suppose that o\ > (72.... > (Jr- Let to be the largest t for which \at\ > S\\M\\p (where (5 is a 
constant that is polynomially related to r and e and will be specified later). Then set 



to 



Wo = Y,iytvf)W; Wi 



t=i 



Lemma 6.4. ||M - ^W^oIIf < ^ ||M||p + 6^/r \\M\\p 

Proof: By the triangle inequality ||M — ^Wo||^ < 
ELto+i (^tiutvf)W, so we have 



E (^tvJ)W. 

t=to + l 

\\M-AW\\p + \\AWi\ 



(5) 



F- 



Also AWi 



\\AW, 



i\\f 



E at{utvJ)W 
t=to+l 



< 



t=to + l 



\W\\p<S\\M\\pV^, 



where the last inequality follows because < s/r and the spectral norm of J2t=to+i^''^t'"t') 

one. ■ 

Next, we establish a lemma that will be useful when searching for (an approximation to) Wo- 

Lemma 6.5. There is an r x m matrix Wq such that each row is in the span of the rows of M and 

which satisfies \\Wq — Wo\\p < 2e/6. 

Proof: Consider the matrix A~^ = Ylt=i ^''^tuj ■ Thus A~^ is a pseudo-inverse of the truncated 
SVD of A. Note that Wo = A+AW and the spectral norm P+II2 is at most l/{6\\M\\p). Then 
we can choose Wq = A'^M. Clearly, each row of Wq is in the span of the rows of M. Furthermore, 
we have 

1 „ 2e 



1^0 - Wo\\p = \\A+{M - AW)\\p < ||M - ^W^ll^ < 



6\\M\ 



■2e\\M\\p < J 
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Lemma 6.6. There is an algorithm that in time 2P°'y('^^°g(i/'))poly(n, m) finds Wl^',W[ and A' 
such that W^' + W{>0, A' >0 and 

\\M - A'iW;; + W[)\\^ < \\A\\p + e\\M\\p + \\M\\p). 

Proof: We use exhaustive enumeration to find a close approximation to the matrix Wq of Lemma 6.5, 
and then we use convex programming to find W[, A': 

The exhaustive enumeration is simple: try all vectors that lie in some ei-net in the span of the 
rows of M, where ei = e/6 . Such an ei-net is easily enumerated in the provided time since the 
row vectors are smaller than = ^/r and their span is r-dimensional. Contained in this net 

there must an Wq such that ||A+M — Wg < ei. Using Lemma 6.5, ||Wo — VI/0II2 — ^^Z*^' ^'^ 
triangle inequality implies \\Wq — Wq\\p < 2e/5 + ei < 4e/6. 

Next, we give a method to find suitable substitutes W{, A' for Wi, A respectively so that Wq + 
W[>0 and A'{Wq + W{) is a good approximation to M. 

Let us assume we know the vectors Vi appearing in the SVD expression (4) and This 
is easy to guarantee since we can enumerate over all choices of the fj's (which are unit vectors in 

using a suitable e2-net where €2 = min{^, 0.1}. Also, ||^||^ is a scalar value that can be easily 
guessed within multiplicative factor 1.01. 

Let W[ = Z he the optimal solution to the following convex program: 



min \\A\\lJ2hm'' + ^^\\M\\l jZ H^f (6) 

t=l t=to+l 

s.t. + Z> 0. (7) 

This is optimization problem is convex since the constraints are linear and the objective function 
is quadratic but convex. (In fact this optimization problem can be separated into m smaller convex 
programs because the constraints between different columns of are independent). 

When the vectors we enumerated (denoted as {v^}) are close enough to the true values {vi}, that 
is, when Yll=i \Wi ~ — value of the objective function after substituting v 

by v' can only change by at most 0{j2 II^IIf + ^<^^ 11-^ I If)- From now on we work with the true 
values of {vi}. The Claim below and arguments after will still be true although the vectors are not 
exact. 

Claim 6.7. The optimal value of this convex program is at most \\A\\p + r6^ ||M||^). 

Proof: We prove that W[ = W - Wq = (Wq - Wq) + T^i is a feasible solution and that the 
objective value of this solution is the value claimed in the lemma. 

Since Wi = X]t=to+i(^*'^i")^ o^^y contributes to the second term of the objective function in 
(6), we can upper bound the objective as 

\\Wo - W;,'\\l \\A\\l + {\\Wo - W;,'\\p + lli^ill^)'^^ ||M||^ . 

The proof is completed because ||VFo — VFq ||^ = 0(f ) and ||Wi||^ < ||1^||f = V^- * 

After solving the convex program, we obtaine a candidate W{. Let W' = Wq + W[. To get 
the right A' (since W' is fixed) we can find the A' that minimizes ||M — by solving a 

least-squares problem. Clearly such an A' satisfies ||M - A'{Wq + W[)\\p < \\M - A(Wq + W[)\\p 
and the latter quantity is bounded by ||M - AWqIIf + WM^o - Wll)\\p, + \\AW[\\p . 
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Lemma 6.4 bounds the first term and Lemma 6.5 bounds the second term. The square of the 
last term is bounded by the objective function of the convex program. ■ 

Finally, by choosing 5 = ^we get A', W = W^+W[ such that \\M - A'W'\\p < 0{(^l'^r^!^) ||M 

Concluding Remarks 

Here, we initiated a rigorous study of nonnegative matrix factorization. Our hardness result rules 
out significant improvements over our worst-case results for fixed inner-dimension r. We believe that 
our poly(m, n, r)-time algorithm for finding separable factorizations may point the way for future 
work. What other plausible conditions can one impose on the factors in real-life applications? We 
also hope our work promotes further theoretical study of nonnegative rank. 

This work is part of a broader agenda of bringing greater rigor to the analysis of algorithms used 
in machine learning. Currently, heuristic approaches are popular because the solution concepts are 
believed to be intractable. Our results, for example our algorithm for NMF under the separability 
condition, raise hope that sometimes the solution concepts may not be intractable after all. 
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A Extended Discussion 



Here we explain an application of NMF in detail. Perhaps the best approach is to contrast the NMF 
problem with a more well-known matrix factorization, the singular value decomposition (SVD): A 
n X m matrix M can be written as M = '^iO'iUivf where the set {ui}i and the set {vi}i are 
orthonormal and cri > (T2.... > 0",- > (see e.g. [l-]). In a number of applications, we imagine 
that the columns of M represent examples and the rows of M represent observed variables. In 
the context of information retrieval one often forms M as a "term- by-document" matrix where 
the {i,jy^ entry in M is the frequency of occurrence of the i^^ term in the j*^ document in the 
database. The SVD of M (e.g. in Latent Semantic Indexing (LSI) [9]) is often interpreted as a 
method to extract "topics" in the database: The set of vectors {ui}i (in a truncated SVD) is the 
subspace that contains the maximum variance of the documents, and projecting columns of M (i.e. 
documents) onto this basis is interpreted as a decomposition of each document into constituent 
topics. Documents can then be compared based on an inner-product in this space. 

In some sense, the decomposition into "topics" generated via SVD is inconsistent with our 
intuitive notion of what a topic is. The vectors {ui}i have both positive and negative values - these 
vectors are orthogonal. For example, imagine some documents are about cars and some others are 
about the weather. These "topics" would both be negatively correlated with mentioning the word 
"elephant" - i.e. documents about either topic are unlikely to use this word. What this means is 
that when we compute the similarity of a pair of documents, the documents will be judged to be 
more similar if both omit the word "elephant". But this is not consistent with our intuitive model, 
and would lead to spurious latent relationships. We would expect similarity to be based on positive 
occurrences only. 

Hofman introduced a related approach (Probabilistic Latent Semantic Indexing [16]) in which 
each document is normalized to be a distribution on words, and the goal is to compute a small set 
of r topics (which are each distributions on words) and represent each document as a distribution 
on topics. This is equivalent (after an appropriate renormalization) to computing a nonnegative 
factorization of the term- by-document matrix M into AW, where the columns of A represent a 
set of r topics and each column of W expresses the corresponding document as a distribution on 
topics. The advantage of requiring this factorization to be nonnegative is that in Hofman's PLSI 
documents are judged to be similar based on words that they both contain. In LSI, documents 
can also be judged to be similar based on words they both omit. Arguably, Hofman's model is 
more consistent with our intuition and maybe this helps explain why (computational issues aside) 
a nonnegative factorization is, in many cases, preferred over an unrestricted one. 
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